34 research outputs found
Moving beyond parallel data for neural machine translation
The goal of neural machine translation (NMT) is to build an end-to-end system that
automatically translates sentences from the source language to the target language.
Neural machine translation has become the dominant paradigm in machine translation
in recent years, showing strong improvements over prior statistical methods in many
scenarios. However, neural machine translation relies heavily on parallel corpora for
training; even for two languages with abundant monolingual resources (or with a large
number of speakers), such parallel corpora may be scarce. Thus, it is important to
develop methods for leveraging additional types of data in NMT training. This thesis
explores ways of augmenting the parallel training data of neural machine translation
with non-parallel sources of data. We concentrate on two main types of additional
data: monolingual corpora and structural annotations. First, we propose a method for
adding target-language monolingual data into neural machine translation in which the
monolingual data is converted to parallel data through copying. Thus, the NMT system
is trained on two tasks: translation from source language to target language, and
autoencoding the target language. We show that this model achieves improvements in
BLEU score for low- and medium-resource setups. Second, we consider the task of
zero-resource NMT, where no source ↔ target parallel training data is available, but
parallel data with a pivot language is abundant. We improve these models by adding a
monolingual corpus in the pivot language, translating this corpus into both the source
and the target language to create a pseudo-parallel source-target corpus. In the second
half of this thesis, we turn our attention to syntax, introducing methods for adding
syntactic annotation of the source language into neural machine translation. In particular,
our multi-source model, which leverages an additional encoder to inject syntax
into the NMT model, results in strong improvements over non-syntactic NMT for a
high-resource translation case, while remaining robust to unparsed inputs. We also
introduce a multi-task model that augments the transformer architecture with syntax;
this model improves translation across several language pairs. Finally, we consider
the case where no syntactic annotations are available (such as when translating from
very low-resource languages). We introduce an unsupervised hierarchical encoder that
induces a tree structure over the source sentences based solely on the downstream task
of translation. Although the resulting hierarchies do not resemble traditional syntax,
the model shows large improvements in BLEU for low-resource NMT
Multi-Source Syntactic Neural Machine Translation
We introduce a novel multi-source technique for incorporating source syntax
into neural machine translation using linearized parses. This is achieved by
employing separate encoders for the sequential and parsed versions of the same
source sentence; the resulting representations are then combined using a
hierarchical attention mechanism. The proposed model improves over both seq2seq
and parsed baselines by over 1 BLEU on the WMT17 English-German task. Further
analysis shows that our multi-source syntactic model is able to translate
successfully without any parsed input, unlike standard parsed methods. In
addition, performance does not deteriorate as much on long sentences as for the
baselines.Comment: EMNLP 201
Dynamic adjustment of language models for automatic speech recognition using word similarity
International audienceOut-of-vocabulary (OOV) words can pose a particular problem for automatic speech recognition (ASR) of broadcast news. The language models (LMs) of ASR systems are typically trained on static corpora, whereas new words (particularly new proper nouns) are continually introduced in the media. Additionally, such OOVs are often content-rich proper nouns that are vital to understanding the topic. In this work, we explore methods for dynamically adding OOVs to language models by adapting the n-gram language model used in our ASR system. We propose two strategies: the first relies on finding in-vocabulary (IV) words similar to the OOVs, where word embeddings are used to define similarity. Our second strategy leverages a small contemporary corpus to estimate OOV probabilities. The models we propose yield improvements in perplexity over the baseline; in addition, the corpus-based approach leads to a significant decrease in proper noun error rate over the baseline in recognition experiments
RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation
Attribute-controlled translation (ACT) is a subtask of machine translation that involves controlling stylistic or linguistic attributes (like formality and gender) of translation outputs. While ACT has garnered attention in recent years due to its usefulness in real-world applications, progress in the task is currently limited by dataset availability, since most prior approaches rely on supervised methods. To address this limitation, we propose Retrieval and Attribute-Marking enhanced Prompting (RAMP), which leverages large multilingual language models to perform ACT in few-shot and zero-shot settings. RAMP improves generation accuracy over the standard prompting approach by (1) incorporating a semantic similarity retrieval component for selecting similar in-context examples, and (2) marking in-context examples with attribute annotations. Our comprehensive experiments show that RAMP is a viable approach in both zero-shot and few-shot settings
MT-GenEval: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation
As generic machine translation (MT) quality has improved, the need for
targeted benchmarks that explore fine-grained aspects of quality has increased.
In particular, gender accuracy in translation can have implications in terms of
output fluency, translation accuracy, and ethics. In this paper, we introduce
MT-GenEval, a benchmark for evaluating gender accuracy in translation from
English into eight widely-spoken languages. MT-GenEval complements existing
benchmarks by providing realistic, gender-balanced, counterfactual data in
eight language pairs where the gender of individuals is unambiguous in the
input segment, including multi-sentence segments requiring inter-sentential
gender agreement. Our data and code is publicly available under a CC BY SA 3.0
license.Comment: Accepted at EMNLP 2022. Data and code:
https://github.com/amazon-research/machine-translation-gender-eva
The University of Edinburgh’s Neural MT Systems for WMT17
This paper describes the University of Edinburgh's submissions to the WMT17
shared news translation and biomedical translation tasks. We participated in 12
translation directions for news, translating between English and Czech, German,
Latvian, Russian, Turkish and Chinese. For the biomedical task we submitted
systems for English to Czech, German, Polish and Romanian. Our systems are
neural machine translation systems trained with Nematus, an attentional
encoder-decoder. We follow our setup from last year and build BPE-based models
with parallel and back-translated monolingual training data. Novelties this
year include the use of deep architectures, layer normalization, and more
compact models due to weight tying and improvements in BPE segmentations. We
perform extensive ablative experiments, reporting on the effectivenes of layer
normalization, deep architectures, and different ensembling techniques.Comment: WMT 2017 shared task track; for Bibtex, see
http://homepages.inf.ed.ac.uk/rsennric/bib.html#uedin-nmt:201